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Abstract. This paper presents faster implementations of the lattice- 
based schemes Dilithium and Kyber on the Cortex-M4. Dilithium is one of 
three signature finalists in the NIST post-quantum project (NIST PQC), 
while Kyber is one of four key-encapsulation mechanism (KEM) finalists. 
Our optimizations affect the core polynomial arithmetic involving 
number-theoretic transforms in both schemes. Our main contributions 
are threefold: We present a faster signed Barrett reduction for Kyber, 
propose to switch to a smaller prime modulus for the polynomial multipli- 
cations cs; and cs in the signing procedure of Dilithium, and apply var- 
ious known optimizations to the polynomial arithmetic in both schemes. 
Using a smaller prime modulus is particularly interesting as it allows 
using the Fermat number transform resulting in especially fast code. 
We outperform the state-of-the-art for both Dilithium and Kyber. For 
Dilithium, our NTT and iNTT are faster by 5.2% and 5.7%. Switching to 
a smaller modulus results in speed-up of 33.1%-37.6% for the relevant 
operations (sum of the base multiplication and iNTT) in the signing 
procedure. For Kyber, the optimizations results in 15.9%—-17.8% faster 
matrix-vector product which is a core arithmetic operation in Kyber. 


Keywords: Dilithium - Kyber - NIST PQC - Fermat Number 
Transform - Number-Theoretic Transform - Arm Cortex-M4 


1 Introduction 


Lattice-based cryptography appears to be the most promising family of post- 
quantum replacements needed for public-key cryptography broken by Shor’s 
algorithm [Sho94]. As lattice-based key encapsulation schemes and digital signa- 
tures provide reasonable key, ciphertext, and signature sizes and have particu- 
larly good performance on a variety of platforms, they are expected to be stan- 
dardized soon. One of such standardization efforts is the NIST PQC [Nat] project 
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aiming to find replacements for NIST’s standards for key establishment and dig- 
ital signatures as early as 2024. NIST PQC is nearing the end of its third round 
with announcements due in early 2022. Among the third round finalists in the 
competitions are 5 lattice-based schemes including the three key-encapsulation 
mechanisms (KEMs) Kyber, NTRU, and Saber as well as the digital signature 
schemes Dilithium, and Falcon. As there are only two other finalists (Classic 
McEliece and Rainbow) that are not lattice-based, which both have excessively 
large keys, it appears very likely that some of the lattice-based schemes are going 
to be selected for standardization unless there are cryptanalytic breakthroughs. 

Lattice-based cryptography is particularly suitable for microcontrollers as 
the key material is still of manageable size and computational performance is 
particularly fast with encapsulation and decapsulation in a few milliseconds while 
signing and verification times in the tens to hundreds of milliseconds. NIST 
has designated the Arm Cortex-M4 as the primary microcontroller optimization 
target for NIST PQC, and, hence, it has received the most attention so far. 

It appears that the number-theoretic transforms are cores of all high-speed 
implementations of lattice-based crypto for the Cortex-M4. It is either pre- 
scribed in the specification of Dilithium, Falcon, and Kyber, or maintains to be 
the fastest polynomial multiplication methods in Saber, NTRU [CHK+21], and 
NTRU Prime [ACC+20]. 

In this work, we focus on Kyber and Dilithium on the Cortex-M4. They are 
both part of the “Cryptographic Suite for Algebraic Lattices (CRYSTALS)” 
and are both designed to benefit from the NTT. We show that even though 
implementations have been improving for many years, we can still significantly 
improve the involved arithmetic. 


Contributions. The contribution of this work is threefold. Firstly, we apply 
various known techniques from work on the Cortex-M4 optimizing Saber, NTRU, 
and NTRU Prime. While the techniques are already known, they have so far not 
been applied to Kyber and Dilithium. This includes (1) the use of Cooley—Tukey 
butterflies for the inverse NTT of both Kyber and Dilithium previously proposed 
for Saber in [ACC+21]; (2) the use of floating point registers for caching values in 
the NTT of Dilithium and Kyber which was first proposed in the context of NTTs 
for NTRU Prime in [ACC+20]. This allows to merge more layers of the NTT 
and reduce memory access time for loading twiddle factors; (3) we make use of 
the “asymmetric multiplication” proposed in [BHK+21] which eliminates some 
duplicate computation in the base multiplication of Kyber at the cost of extra 
stack usage; and (4) we use an idea from [CHK+21] to improve the accumulation 
in the matrix-vector product of Kyber by using a 32-bit accumulator allowing to 
eliminate some modular reductions at the cost of more stack usage. 

Secondly, we present a faster Cortex-M4 instruction sequence to implement 
a signed Barrett reduction on packed 16-bit values applicable to the Kyber NTT. 
This immediately improves the Barrett reduction code proposed in [BKS19] from 
8 cycles to 6 cycles per packed reduction. 

Thirdly, we propose to use a different implementation for computing the 
product cs; as well as csg in Dilithium. Since both c and s;/sg have very small 
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absolute values, we can switch to a much smaller modulus q’ that allows effi- 
cient computation of the product. For Dilithium2 and Dilithium5, we make use 
of the Fermat prime q’ = 257, which allows using a particularly fast variant of 
the NTT called the Fermat number transform (FNT), similar to [LMPR08] for 
SWIFF'T. Furthermore, [LMPR08] implements FNT on an Intel processor while 
we implement FNT on the Cortex-M4 and make use of its barrel shifter. For 
Dilithium3 the FNT does not work as sı and s2 have larger values. We instead 
use an incomplete NTT with q’ = 769 which is still much faster than computing 
it modulo the original Dilithium prime. To best of our knowledge, we are the first 
to propose using a smaller modulus for these multiplications within Dilithium. 


Code. Our code is open-source and available at https://github.com/Faster 
KyberDilithiumM4/FasterKyberDilithiumM4. We will publish the code along- 
side the paper under a CCO copyright waiver. 


Structure. Section 2 recalls the preliminaries regarding Kyber, Dilithium, and 
the Cortex-M4. In Sect.3 and Sect. 4, we describe the optimizations applied to 
Kyber and Dilithium, respectively. Lastly, in Sect.5, we present the performance 
results and compare them to previous work. 


2 Preliminaries 


This section introduces the cryptographic schemes Kyber and Dilithium, which 
are both part of the Cryptographic Suite for Algebraic Lattices (CRYSTALS). 
Furthermore we give a brief introduction into the polynomial multiplication using 
the NTT, revisit the Barrett reduction and present relevant details considering 
our target platform, the Arm Cortex-M4. 


2.1 Notation 


For a prime q and a power of two n, we denote the polynomial ring Z,|X]/(X"+ 
1) by Rq. An element a € Ry is represented by a coefficient vector a; € Zq, such 
that a = >. a;X*. We denote polynomials using lower-case letters (e.g., a), 
vectors of polynomials using lower-case boldfaced letters (e.g., a), and matrices 
of polynomials using upper-case boldfaced letters (e.g., A). We symbolize poly- 
nomials, vectors, and matrices inside NTT-domain using a, a, and A, respectively. 

Following the definitions from [BDK+20, ABD+20], for an odd q we define 
the result of the central reduction r’ = rmod~*q as the unique element in 
Le, a) satisfying r’ = r mod q. Similarly, we define the result of r’ = r mod 
tq as the unique element in [0, q) satisfying r’ = r mod q. For scenarios in which 
the range of the reduction result does not matter, we write r’ = r mod q. 

The function sampleUniform(-) samples coefficients for polynomials, vectors 
of polynomials, or matrices of polynomials from a uniformly random distribution. 
In case a seed is given as the argument, the output is pseudorandomly generated 
from the seed. 
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2.2 Polynomial Multiplications Using the NTT 


The NTT is a variant of the discrete Fourier transform (DFT) defined over finite 
fields and is commonly used for efficient polynomial multiplications. The effi- 
ciency of this strategy is based on the fact that a polynomial multiplication 
inside NTT domain amounts to the coefficient-wise multiplication of the two poly- 
nomials. Specifically, the negacyclic NTT is used for multiplying polynomials in 
Zq|X]/(X" +1). 

Computing the negacyclic NTT can be viewed as the evaluation of a polyno- 
mial at powers of a primitive n-th root of unity ¢, for the polynomial ring R, 
with q prime. Additionally, multiplying all coefficients a; of a € Rg by powers 
of a 2n-th root of unity Go, = WCn is called “twisting” [Ber01]. 

This comes down to computing 


n-1 


n-1 
NTT(a) =4= 5 âX’ with @ = Y` a6}, 
1 rw 


for the forward transform (NTT) and 


n—-1 n—1 
iNTT(4) =a =$ aX with a, =n" SY yG 
i=0 j=0 


for the inverse transform (iNTT) [AB74]. The powers of the roots of unity used 
during the computation of the NTT are also frequently called “twiddle factors”. 

For computing the NTT itself efficiently, fast Fourier transform (FFT) algo- 
rithms, which only require O(n log n) operations, are commonly used. This algo- 
rithm was first described by Gauss in 1805 [Gau66] but it is also oftentimes cred- 
ited to Cooley and Tukey who published the same algorithm in 1965 [CT65]. The 
basic idea of the algorithm is to split the computation of a length n NTT into, most 
commonly, two separate number-theoretic transforms (NTTs) with an input size 
of n/2 each. Formally, we compute the isomorphism R4 > [], Zg[X]/(X — G,,) 
for i = 1,3,5,...,2 — 1 as given by the Chinese Remainder Theorem (CRT), 
as also explained in [BDK+20, Section 2.2]. For example, in the first instance 


we map Z,[X]/(X" + 1) to Zq[X]/(X"/? — Gil?) x ZqlX]/(X"/? + Gi). This 
splitting is usually repeated for log, n iterations, called “NTT layers”, where the 
results of the i-th layer are the remainders of polynomials a mod (X ae 4 ee 
for some j. Computing these remainders involves n/2 so-called butterfly opera- 
tions per layer. The CooleyTukey (CT) butterfly, consisting of one addition, one 
subtraction, and one multiplication in Z,, is depicted in Fig. la. 

While the CT algorithm is frequently used for computing the NTT, the 
Gentleman-—Sande FFT algorithm is commonly deployed for computing the iNTT. 
In contrast to this, we use the CT algorithm for the computation of the NTT and 
its inverse. A depiction of the GS butterfly is Fig. 1b. 

Using this method, the product of f,g € Rq can be efficiently computed as 
iNTT(NTT(f) oNTT(g)), where o indicates the base multiplication of two polyno- 
mials. In case the NTT is computed on logn layers, base multiplication is equal to 
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a > a+ Qb a >a+b 
b x >a—Cb b © > z(a b) 
¢ ¢ 
(a) Cooley—Tukey butterfly (b) Gentleman—Sande butterfly 


Fig. 1. NTT butterfly operations 


coefficient-wise multiplication requiring only n multiplications. In case the NTT is 
computed on l < log n layers, yielding 2! polynomials mod 2” —w form = ar and 
w a power of a root of unity, it is called an “incomplete” NTT. For this scenario, 
the base multiplication corresponds to pairwise m x m schoolbook multiplica- 
tions. This idea was initially introduced in [LS19] for the case of the modulus not 
supporting an NTT on logn layers, but is also applied for performance reasons in 
several other implementations, for example, [ABCG20, CHK+21, ACC+21]. 


2.3 Fermat Number Transform 


The Fermat number transform (FNT) is a special case of NTT in that the 
modulus is a Fermat number F, := 22 + 1. It was introduced in [SS71] for 
large integer multiplications and in [AB74,AB75] for digital convolutions. In 
this paper, we implement FNT for negacyclic convolution. For arbitrary F; as 
the modulus, cyclic transformations of sizes dividing 2°t? are supported [AB74, 
AB75]. For computing a negacyclic transformation of size n = 2°*! and Co, = 
V2, the first split becomes 


-1 


Zr,[X]/(X"— 27) =Zp,[X]/(X? — 27") x Zr [X] (X? +2) 
ST (XRF eee xe ao), 


After applying t layers, all of the polynomial rings are of the form Zp, [x] /(X 2? — 
21) where j is an odd number. Since (3, = 2, we can apply one more split. 
Furthermore, if F; is a prime, then we can compute cyclic transformations of 
sizes up to 2” = F,—1 and negacyclic transformations up to 2?'-1. Since the 
twiddles in initial t layers are powers of two, we can multiply with the twiddles 
using shift operations which is much cheaper than explicit multiplications on 
many platforms. Note that the only known prime Fermat numbers are Fo = 3, 
F, = 5, Fo = 17, F3 = 257, Fy = 65537. Out of those, only F3 and F4 appear 
promising for the use in Dilithium. They allow to compute 3 or 4 layers using 
only shifts. 
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2.4 Kyber 


Kyber [ABD+20] is an IND-CCA2-secure lattice-based key-encapsulation mech- 
anism(KEM) constructed from an IND-CPA secure public-key encryption 
scheme Kyber.CPAPKE using a variant of the Fujisaki-Okamoto (FO) trans- 
form [FO99]. The security of the scheme is based on the hardness of the 
module-learning with errors (MLWE) problem, a trade-off between the ring- 
learning with errors(RLWE) problem and learning with errors (LWE) prob- 
lem [ABD+20, Section 1.5]. Kyber is one of four round-three KEM-finalists in 
the NIST PQC [Nat] next to Saber [DKRV20], NTRU [CDH+20], and Classic 
McEliece [ABC+20]. 


Parameters. Kyber uses q = 3329 as its prime and n is chosen to be 256. 
Thus, it operates on Rg = Z3329[X]/(X?°° + 1) [ABD+20, Section 1.4]. The 
specification defines three different security levels of Kyber, namely Kyber-512 
(k = 2,m = 3), Kyber-768 (k = 3, = 2), and Kyber-1024 (k = 4,m, = 
2) [ABD+20, Section 1.4]. Due to the fact that q and n remain constant across 
the three parameter sets, almost all possible optimizations apply to all variants. 
For the specification of Kyber, we refer to [ABD+20] and omit the description. 


Number Theoretic Transform. Since polynomial multiplication is among 
the most costly operations for Kyber, the polynomial ring has been chosen, such 
that Kyber can profit from efficient polynomial multiplication using the NTT. 

For q = 3329, as deployed in Kyber, no primitive 512-th but only primitive 
256-th roots of unity exist for Rq with the first one being ¢, = 17 [ABD+20]. 
This means that the defining polynomial of R, (X7°° + 1) factors into 128 
polynomials of degree one and not into 256 polynomials of degree zero. Therefore, 
the result of the NTT of f € Rg is a vector of 128 polynomials of degree one. 
Thus, in contrast to Sect. 2.2, the coefficients â; inside NTT domain are given by 

127 127 
do; = 2 ag, C OE, and @a;41 = Y aya Ons 
j=0 j=0 
as defined in [ABD+20]. The function bry computes the bit reversal of a 7-bit 
integer on its argument. 

The absence of a primitive 512-th root of unity also has an impact on the 
base multiplication of two polynomials inside NTT domain: Instead of coefficient- 
wise multiplication, we need to perform schoolbook multiplications of size 2 x 2, 
i.e., we need to compute 128 products mod (X? — A) [ABD+20]. 


2.5 Dilithium 


Dilithium [DKL+18, BDK+20] is a lattice-based digital signature scheme based 
on the “Fiat-Shamir with Aborts” approach [Lyu09]. Its security is based on the 
hardness of the modular short integer solution (MSIS) and MLWE problems and 
it is currently among the three signature-finalists in the NIST PQC project [Nat], 
next to Falcon [FHK+20] and Rainbow [CDK+20]. 
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Parameters. Dilithium deploys the prime q = 8380417 = 2? — 213 + 1 and 
operates on the polynomial ring Ra = Z,|X]/(X" + 1) with n = 256. The two 
parameters q and n are the same across all parameter sets. 

Dilithium offers three different parameter sets, namely Dilithium2, Dilithium3, 
and Dilithium5, which target the three NIST security levels 2, 3, and 5. More 
details on the differences between the three parameter sets can be obtained from 
Table 1. The matrix dimension is given by (k,l), the bounds for sampling the 
secret key by 7, the number of +1 in the challenge polynomial c is 7, and #reps 
refers to the expected number of repetitions during the rejection sampling in the 
signature generation process [BDK-+20]. The parameters yı and y2 define the 
range for the coefficient y and the low-order rounding range [BDK+20]. 


Table 1. Overview of Dilithium’s parameter sets [BDK+20] 


Scheme NIST level (k,l) 7 T % y2 #reps |pk| | sig | 
Dilithium2 2 (4,4) 2 39 2" (q—1)/88 4.25 1312B 2420B 
Dilithium3 3 (6,5) 4 49 2° (q—1)/32 5.1 1952B 3293B 
Dilithium5 5 (8,7) 2 60 27° (q—1)/32 3.85 2592B 4595B 


We refer to [BDK+20] for the specification of Dilithium and omit the descrip- 
tion. 


Number Theoretic Transform. Since the main algebraic operations used 
by Dilithium are polynomial multiplications, Dilithium’s ring was chosen in such 
a way that the NTT can be applied [BDK+20]. In contrast to Kyber, for the 
Dilithium ring, a 2n-th primitive root of unity r = 1753 exists |BDK+20] and 
thus it is possible to compute a complete NTT with eight layers as described in 
Sect. 2.2. This allows for base multiplication by coefficient-wise multiplication. 


2.6 Barrett Reduction 


The Barrett reduction [Bar87] is an efficient algorithm for reductions in Zg. 
Besides its performance, one advantage is that it can be easily implemented in 
constant-time. A variant of the Barrett reduction that operates on signed integers 
has been presented in [Seil8, Algorithm 5] which has also been deployed in a 
previous implementation of Kyber |ABCG20]. Algorithm 2.1 is an illustration. 


Algorithm 2.1: Signed Barrett Reduction [ABCG20] 


Input :q withO<q< "ali and a with -£8 <a<& 
Output: r with r=a(modq),0<r<q 


Loe ew > precomputed 
ate late > signed high product and arithmetic right shift 
3 t — tq mod 8 > signed low product 


4 returnr<a-—t 
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2.7 Arm Cortex-M4 


The target platform for our implementation is the Arm Cortex-M4(F), which 
is a NIST-recommended evaluation platform for the candidates of the NIST 
PQC project. The Arm Cortex-M4 is based on the Armv7E-M instruction set 
architecture with 14 usable 32-bit general purpose registers. Additionally, on 
the Cortex-M4F, there are 32 single-precision floating-point registers [ARM11]. 

The instruction set also provides a number of powerful digital signal pro- 
cessing (DSP) instructions which allow to perform arithmetic operation on 
two half words or four bytes at the same time and have proven themselves 
to be beneficial in numerous implementations [BKS19, ABCG20, KMSRV18] 
of Kyber [ABD+20], and Saber [DKRV20]. In particular, the instructions 
smul{b,t}{b,t} multiply specific halfwords and smla{b,t}{b,t} multiply spe- 
cific halfwords and accumulate the product to the specified accumulator. Addi- 
tionally, the instructions smuad{,x} perform two halfword-multiplications and 
add up their products, while smlad{,x} perform two halfword-multiplications 
and add up their products which is then added to an accumulator. All of these 
instructions take one cycle to execute. Moreover, the Cortex-M4 can compute 
the 64-bit product of two 32-bit values (optionally, with accumulation) in a sin- 
gle cycle. Furthermore, the Cortex-M4 provides a barrel shifter for shifting or 
rotating the second operand for certain instructions with no additional cost. 

On the Cortex-M4, store instructions always take a single cycle, while a 
sequence of independent loads takes n + 1 cycles. Using the vldm instruction, it 
is possible to directly load data from the memory into the floating point registers. 
This also consumes n + 1 cycles for n data words. 


3 Improvements to Kyber Implementations 


For Kyber, we propose several optimizations for implementing NTT and iNTT and 
some speed optimizations to the matrix-vector product at the cost of a higher 
stack usage. We provide one implementation with all optimizations and one with 
only the optimizations that do not impact the stack usage. 

We base our implementations on [ABCG20] and the implementation in the 
pqm4 [KRSS19] project. In the following we focus on our contributions and omit 
details of the numerous optimizations present in previous implementations. 


3.1 NTT 


Caching in FPU Registers. For Kyber, on the layers 7—4, 15 twiddle factors 
are required and re-used multiple times throughout the iterations. By using the 
floating-point registers for caching the twiddle factors, the number of cycles for 
memory loads are reduced. This technique has been proven to be beneficial in 
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past work [ACC+20, CHK+21, ACC+21]. In our implementations, we load the 
15 twiddle factors (packed into eight registers) into the floating-point registers 
once with vldm instruction in nine cycles. Then, in each iteration the twiddle 
factors are fetched from the floating-point registers with vmov in a single cycle 
each. 

On the three remaining layers, it is not beneficial to make use of the floating 
point registers because in each of the 16 iterations at least one unique twiddle 
factor per layer is required, meaning none of the twiddle factors are re-used. 


Better Layer Merging. In our implementations we make use of the common 
optimization strategy of merging layers of the NTT computation [GOPS13]. The 
idea behind this strategy is to load multiple coefficients at once such that more 
than one layer of NTT can be computed at a time. This reduces the number of 
memory operations required at the cost of taking up more registers. The state- 
of-the-art implementation of Kyber |ABCG20] also deploys this strategy merging 
layers 7-5 and 4—2 while computing layer 1 separately. 

By making use of the floating point registers, we instead implement the NTT 
by merging layers 7-4 and 3-1. Layers 7—4 can be merged by first computing 
three layers of NTT on each (a1, a3, @5,@7, 9, Q11, 413, 415) and (ao, a2, a4, 6, ag, 
419,412,414) and then combining their results. First, the NTT on (a1, a3,...,@15) 
is computed and each of the layer 5 outputs is multiplied by the correspond- 
ing twiddle factors of the fourth layer. Then, (a1, a3,...,@15) are moved to the 
floating point registers for later use. After that, the polynomials (ao, a2,..., a14) 
are loaded and the NTT is computed on them. Finally, we vmov (do, a2,..., 14) 
one at a time and compute the final add-sub. In summary, this requires 128 
additional vmovs, whereas a separate layer requires 128 loads and 128 stores. 


3.2 Inverse NTT 


The most significant change we apply to the inverse NTT is the switch from 
Gentleman-—Sande butterflies to Cooley—Tukey butterflies. Therefore, all of the 
optimizations mentioned in the context of the NTT also apply to the inverse NTT. 


Switch to CT-Butterflies. In previous implementations of Kyber for the 
Arm Cortex-M4, the NTT was always implemented using CT butterflies, while 
the inverse NTT was implemented using GS butterflies, which is a commonly 
seen pattern for implementations using the NTT in general. Opposed to that, we 
implement the inverse NTT using CT butterflies in order to avoid the necessity 
of intermediate modular reductions by limiting the coefficients’ growths, as for 
example suggested in [Seil8, Section 2.1] or implemented for Saber in [ACC+21]. 
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Algorithm 3.1: Packed Barrett Algorithm 3.2: Improved Packed 


Reduction [BKS19] Barrett Reduction 
Input : a= (a || a») Input : a= (a: || a») H 
Output: c= (cs || c») mod *q Output: c= (c: || ch) mod “q 


32 
smlawb to, —| 7], a, 2" 
smlabt to,q,to,a 

32 
smlawt tn-l% l,a, 2" 
smulbt tı,q,tı 
add tı,a, tı, 1s1 #16 
pkhbt c,to,tı,1s1 #16 


1 smulbb to, a, |2] 
226 
smultb tı, a, || 
asr to, to, #26 
asr ti, tı, #26 
smulbb to, to, q 
smulbb t1,ti,q 
pkhbt to,to,t1,1sl1 #16 
usubié6 r,a, to 


oOarnran#bk w wb 
ana k 6 NY 


Using CT butterflies for the inverse NTT requires to do additional twisting 
during the computation of the last layer but the total number of multiplications 
does generally not increase because multiplications in the same amount can be 
omitted during the butterfly operations (“light butterflies”). One side effect of 
this approach is that some coefficients will grow larger than in the forward NTT 
because the multiplications in the butterflies always include reductions and now 
the operands of the addition and subtraction in the butterfly are not always 
limited by this. To counteract, we insert two modular multiplications on the 
fourth layer to limit the growth of the coefficients to be in (—9q, 9q), at most 
after the fourth layer. By detailed range analysis, we found that on the last three 
layers we need 20 additional reductions on packed arguments in total. 

Moreover, the Montgomery multiplication during the twisting removes the 
need of a separate Barrett reduction of every coefficient at the end of the last 
layer. This saves 256 Barrett reductions. 

Note that due to the new structure of the iNTT the input coefficients’ absolute 
values need to be smaller than q. 


3.3 Faster Barrett Reduction 


Similar to previous implementations, we deploy the Barrett reduction to reduce 
the coefficients. The Barrett reduction of two 16-bit integers packed in one 32-bit 
register has been previously implemented [BKS19] as shown in Algorithm 3.1. 
Using the smlaw{b,t} instructions as in Algorithm 3.2, the cycle count of one 
Barrett reduction is reduced by one. This means for reducing a packed argument, 
two cycles are saved. In contrast to the implementation from Algorithm 3.1, the 
technique presented in Algorithm 3.2 requires two Barrett constants which are 
both different from the previous one. Moreover, using this optimization removes 
the parantes of the reduction’s result being in [0,q), instead it will result in 
je 


—+ >|, 4] for an odd q. Therefore, its output must not be passed to one of the 


packing or compression functions because they assume the input to be in [0, q). 
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This means, it may not be used in the poly_reduce function but it can be used 
inside the NTT and iNTT. 


3.4 Matrix-Vector Product 


For speed optimization of the matrix-vector product, we implement two tech- 
niques. Both of them require additional stack space and therefore, if a low mem- 
ory footprint is a concern, the applicability needs to be checked. Further, we 
re-implement the C function for the computation of the matrix-vector product 
in assembly which allows us to significantly lower the number of function calls 
required by efficiently using the registers and making use of macros. We proceed 
similarly for the inner product in the decryption. 


Asymmetric Multiplication. For the computation of the matrix-vector prod- 
uct As in Kyber, we compute iNTT(A oNTT(s)). During this computation, every 
row of A needs to be multiplied by s. Therefore it is a common strategy to cache 
the result of § instead of recomputing it for every row of A [BKS19]. Using a 
trick for integer multiplication presented in [BDL+11,BHK+21] extended the 
aforementioned concept for which incomplete NTTs are deployed. 

Recall that the Kyber NIT is incomplete, i.e., 7 instead of 8 layers are 
computed, and therefore the product of two polynomials inside NTT-domain 
a@os = ĉ consists of 128 2 x 2 schoolbook multiplications. For computing 
C25 + ĉ2i41 X = (G2; + Gioi41X ) (894 + ŝ2i41X) mod (X? = Crt), we have 
Ca; = û2i2i + Goin 1 2i C77 Ft and Coj41 = Gi S2i41 + §2i4ai41.- 

The idea behind the proposal from [BHK+21, Section 4.2] is that during the 
computation of Ao S, each polynomial of § is used k times which means that 
the computation of Bog is repeated k times. This can be avoided by 
caching the intermediate results of §2;41¢ 2br7(2)+1 in a separate vector §’. 

We implement two separate variants for the base multiplication, one of which 
is only used for the first row of the matrix in the matrix-vector product, while the 
other one is used for all of the following ones. The first variant computes the same 
base multiplication as before except that it stores the result of §9;41¢ 2br7(i)+1 sep- 
arately. This comes at the cost of two additional stores and one additional load 
from the stack for the argument containing the address of s’ per two polynomial 
multiplications. The second variant saves two smultb instructions, two mont- 
gomery reductions, and the load of one twiddle factor per two polynomials by 
loading the cached values instead. The precomputed vector can also be re-used 
in the inner product following the matrix-vector multiplication in encryption. 


Better Accumulation. We also make use of an improved accumulation strat- 
egy in the matrix-vector product as presented in [CHK+21]. For the computation 
of one element of the output vector in a matrix-vector product, a total number 
of k base multiplications as well as k — 1 accumulating additions are required. 
Instead of reducing each coefficient directly after the base multiplication before 
accumulating, we delay this step until all three base multiplication results have 
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been accumulated. We also implement this technique for the computation of the 
inner product. For the implementation, we define three variants of the caching 
and non-caching base multiplication functions each: One that takes 16-bit input 
values and writes to a 32-bit output array, one that takes unreduced 32-bit input 
values and writes to a 32-bit output array, as well as one function that also takes 
unreduced 32-bit input values but outputs reduced and packed coefficients in 
a 16-bit integer array. For the second type of the function, the operation on 
32-bit values also allows for usage of smla{b,t} instead of smul{b,t} such that 
no extra addition is required for the accumulation, compared to the case when 
computing on packed 16-bit coefficients. 

Due to the small size of the Kyber prime, the sum will never overflow a signed 
32-bit integer: For the matrix-vector products in Kyber using asymmetric multi- 
plication, possible vector-inputs are the output of an NTT which is in [-, a) 
or the cached Montgomery multiplication result from the asymmetric multipli- 
cation which is in (—q,q). The coefficients of the matrix generated using the 
on-the-fly approach from [BKS19] are smaller than q. Therefore, the maximum 
result for one of the multiplications is € (—q?,q?). For k accumulations with 
k € {2,3,4}, we get a maximum absolute intermediate value of kq? = 4q? < 221. 


4 Improvements to Dilithium Implementations 


For Dilithium we deploy similar strategies for optimizing the NTT and iNTT as for 
Kyber and optimize the multiplication of c and s4, as well as c and ss. 


4.1 NTT and Inverse NTT 


For the NTT, we merge the layers as 7—5, 4-2, 1-0 to reduce the number of mem- 
ory operations. This differs from the previous implementation [GKS20, GKOS18] 
where layers 7—6, 5—4, 3-2, and 1-0 are merged. For the iNTT, we similarly switch 
to CT-butterflies and merge as in the NTT. 


Switch to CT-Butterflies. Just as for Kyber, we switch to CT butterflies for 
the computation of the iNTT. Further, we make use of a technique introduced 
in [ACC+21, Appendix D| which computes light butterflies with one less reduc- 
tion. As opposed to the Kyber, the coefficients’ growth due to the light butterflies 
is not of concern for the Dilithium since values up to 256q fit in a 32-bit register. 


4.2 Small NTTs for Dilithium 


In the signature generation of Dilithium, we recall that the polynomial c consists 
of 7 +1’s and 256 — 7 0’s, and all polynomials in sı and s2 consist of elements 
in [—7, 7]. The absolute values of the coefficients in cs; and csg are bounded 
by Tn, and the computation can be regarded as in Zy for q! > 27m [CHK+21, 
Section 2.4.6]. As far as we know, all implementations choose q’ = 8380417 and 


Faster Kyber and Dilithium on the Cortex-M4 865 


employ the NTT defined for Dilithium. However, since only the correct cs; and 
CS2 are required, there is some freedom for choosing q’. The parameters T -n 
are 39-2 = 78 for, 49-4 = 196 for Dilithium3, and 60-2 = 120 for Dilithium5. 
Consequently, we choose the Fermat number q/ = F; = 257 for Dilithium2 and 
Dilithium5, and q’ = 769 for Dilithium3. Alternatively, one can also re-use the 
Kyber prime q’ = 3329 for any of the parameters in case re-using the code is of 
interest. We have also experimented with the Fermat number q’ = F4 = 65537 for 
Dilithium3. However, this did not result in in a speed-up compared to q’ = 769. 


FNT for Dilithium2 and Dilithium5. For q = 257 = 2° + 1, we have FNT 
defined over Zo57[X]/(X7°®+1). We implement the forward transformation with 
7 layers of CT butterflies. Since the input coefficients for c, s1, and sg are at 
most in [—7,7], we only need very few reductions. Recall that a CT butterfly 
maps (a,b) to (a + wb,a — wb), we can implement it with mla and mls. Fur- 
thermore, we can also take a closer look at the initial layers. Since —1 = 28 
(mod 257), the first layer can be written as Zo57[X]/(X?7°° + 1) S Zos7[X]/ 
(X18 — 24) x Zo57[X]/(X178 + 24) and the corresponding CT butterfly maps 
(a,b) to (a+ 24b, a—2*b). We denote such computation as CT_FNT(a, b, 4). Notice 
that without loading twiddle factors, we can implement CT_FNT(a, b, logW) effi- 
ciently with barrel shifter as illustrated in Algorithm 4.1. 

Let iFNT be the inverse of FNT. We first observe that the inverse of 2} can be 
written as 2~* = 216-k = _28-* (mod 28 + 1). There are two places where we 
need to multiply by an inverse of a power of two: (i) the inverses corresponded 
to the butterflies with w = 2!°&” in CT_FNT, and (ii) the scaling by 1287! at 
the end of iFNT. We denote CT_iFNT(a, b, logW) as the function mapping (a, b) 
to (a — 21°8Wb, a + 21°8Wp) = (a + 28+108Wb q — 28+1°8"5) and implement it with 
barrel shifter as shown in Algorithm 4.2. Clearly, if CT_FNT(a,b,k) computes 
(a+ 2*b, a — 2*b), then CT_iFNT(a, b,8 — k) computes (a+ 27*b, a —27*b) which 
can be used in iFNT. We compute iFNT with four layers of GS butterflies followed 
by three layers of CT butterflies. During the GS butterflies, since the twiddle 
factors are also very small, we can replace some of the mul, add, and sub with 
mla and mls. For CT butterflies, since the twiddle factors are powers of two, 
we implement them with Algorithm 4.2. Lastly, at the end of CT butterflies, we 
merge the twisting by powers of two with the multiplication by 12871. 


NTT over 769 for Dilithium3. For Dilithium3, since the maximum absolute 
value of cs; and cSg is bounded by TN = 4-49 = 196, we cannot use q’ = 257 < 
2-196. We therefore choose q’ = 769 and modify the NTT and iNTT from Kyber. 
Except for discarding most of the Barrett reductions, the code is the same. 

Recall that for the NTT in Kyber, we require the output to be in es 1] for 
the secret key. However, for Dilithium3, since we are only using 16-bit NTT for 
computing cs; and cs2, we can remove the Barrett reductions at the end and 
allow elements growing up to 7q’ in absolute value. 
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Algorithm 4.1: CT_FNT(a, b, logW). Algorithm 4.2: CT_iFNT(a, b, logW). 


Input : (a,b) = (a,b) Input : (a,b) = (a,b) 
Output: (a,b) = Output: (a,b) = 
(a + 2°°8"b, a — 2°°8"b) (a — 2°°8"b, a + 2°°8"5) 
1 adda, a,b, lsl #logW 1 sub a, a, b, lsl #logW 
2 sub b, a, b, 1lsl #(logW+1) 2 add b, a, b, 1sl #(logW+1) 


For the iNTT, replacing with q’ = 769 allows us to postpone the Barrett 
reductions by one layer and reduce the number of Barrett reductions by half. At 
the end of iNTT, we replace the 16-bit Montgomery multiplication with straight 
multiplication and 32-bit Barrett reduction. By using 32-bit Barrett reduction, 
the result is within [—384, 384] if the product is in [—113025697, 113025697]. 
Since logo ( 48928697) ~ 18.17, we derive values in [—384, 384] by applying 32-bit 
Barrett reduction to the product of any signed 16-bit value and any constant 
from [—384, 384]. The downside for using 32-bit Barrett reduction is a slightly 
higher register pressure, but overall it is more favorable because we don’t need 
to reduce them again. This is different from the 16-bit NTT in [ACC+21]. They 
implemented the twist with Montgomery multiplication and then reduced the 
result to [—384, 384] with an additional 32-bit Barrett reduction. 


5 Results 


In this section, we present the implementations results of Kyber and Dilithium. 


5.1 Benchmarking Setup 


Our concrete hardware target is the STM32F4DISCOVERY with the STM32- 
F407VG MCU, which also is the target of previous publications concerning imple- 
mentations of post-quantum schemes on microcontrollers. It comes with 1 MiB 
of flash memory, and 192 KiB of RAM. 

Our benchmarking setup is based on pqm4 [KRSS19]. During the bench- 
marks, we clock the microcontroller at 24 MHz in order to avoid wait states 
during memory operations. We compile the code using arm-none-eabi-gcc ver- 
sion 10.2.1 with the -03 option. Regarding the Keccak implementation, we make 
use of the code provided in pqm4. For the randomness generation we rely on the 
microcontroller’s hardware random number generator (RNG). 

We compare our Kyber implementations to the code currently present in 
pqm4 which is based on the work in [ABCG20] and [BKS19]. Similarly, we com- 
pare our implementations of Dilithium (2 and 3) to the code in pqm4 which is 
based on [GKS20]. For Dilithium5, pqm4 does not currently have an implemen- 
tation due to a lack of stack space. We apply some of the stack optimizations of 
[GKS20] to our implementations, especially to make Dilithium5 work as well. It 
is important to note that the parameters of Kyber and Dilithium were changed 
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at the start of the third round of the NISTPQC competition. The numbers 
presented here reflect the round 3 versions contained in pqm4. Those are opti- 
mizations from the original papers ported to the third round parameters. The 
performance results for the full schemes do not match the original publications. 


5.2 Performance of NTT-Related Functions 


In Table 2, we present the cycle counts for the transformations we deploy in our 
implementations of Kyber and Dilithium. For the Kyber NTT, we achieve a speedup 
of 12.6%. Regarding the Kyber iNTT, we obtain a speedup of up-to 21.3%. Note 
that for the stack-optimized variant an additional reduction is required before 
the iNTT because of the absence of asymmetric multiplication. 

We achieve a speedup of 5.2% for the Dilithium NTT, and 5.7% for the iNTT. 
For the small NTTs the metric we are optimizing is (k + l) - NTT + #reps - 
(NTT + (k + l) - (basemul + iNTT)). As most of the small NIT are computed 
outside of the loop, we moved some of the reductions into the NTT resulting in 
a faster basemul. Note that for q = 257 and q = 769 the NTT and iNTT have 
very close performance, but the basemul differs. This results in the FNT being 
advantageous for Dilithium2 and Dilithium5. For (basemul + iNTT), we achieve a 
speedup of 37.6% for q = 257, and 33.1% for q = 769 compared to q = 8380417 
from [GKS20]. We also compare our q = 769 implementation to an existing one 
by [ACC+21], because theoretically, their 6-layer approach could also be used 
as well. Since the computation is dominated by (basemul + iNTT), we find that 
our 7-layer approach is faster. We also carefully examine the code by [ACC+21], 
and find that the last 32-bit Barrett reduction is performed outside the reported 
iNTT, so the speedup is more. 

Table 3 contains the result for our benchmarks of the MVP and inner product 
(IP) functions as deployed in Kyber. For the MVP, we consider the MVP as it 
is computed in the key generation. The MVP in the encryption is similar but 
contains k NTTs less. Note that in the actual implementation of Kyber, the 
MVP is interleaved with the on-the-fly generation of the matrix. For ease of 
comparison, we additionally provide benchmarks for a stripped down variant 
of the MVP excluding the hashing. Regarding our benchmarks, we count the 
caching for the asymmetric multiplication towards the MVP although the IP for 
the encryption also benefits of this pre-computation. For the same reasons as for 
the MVP, the benchmarks of our IP functions only include the NTTs, the base 
multiplications, and deserialization, if applicable. For the speed optimized MVP 
implementation, we get speedups between 15.9% and 17.8% (excl. hashing). The 
stack optimized variant, achieves speedups between 12.1% and 12.5%. We achieve 
speedups of 26.9%-31.7% (enc) and 21.6%-23.3% (dec) for the speed optimized 
inner product, while for the stack variant we obtain speedups of 4%-6.3% and 
17.3%-18.9%, respectively. We observe that for larger k, the speed optimization 
strategy gives increasingly lower cycle counts due to asym. multiplication. 
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Table 2. Cycle counts for transformation operations of Kyber and Dilithium. NTT 
and iNTT correspond to the schemes default transformations, i.e., q = 3329 for Kyber 
and q = 8380417 for Dilithium. The NTT with q = 257 is deployed for Dilithium2 and 
Dilithium5, and the NTT with q = 769 is used used for Dilithium3. 


Prime Implementation NTT iNTT basemul 

Kyber q = 3329 [ABCG20} 6 852 6979 2317 
This work 5992 5491/6282" 1613? 

Dilithium q = 8380417 [GKS20] 8 540 8 923 1955 
This work 8 093 8415 1955 

q = 257 This work 5 524 5563 1225 

q = 769 [ACC+21] (6-layer) 4852 4817 2 966 

This work 5200 5537 1740 


* First value is for speed-optimization, second for stack-optimization. 

> Asymmetric basemul as used in the IP (enc). As the basemul in the MVP 
and IP consists of individual function calls, the cycle count is not straight 
forward to measure. 


Table 3. Cycle counts for matrix-vector and inner products used in Kyber. 


implementation variant operation Kyber-512 Kyber-768 Kyber-1024 
pqm4 Matrix-Vector Product®* 66 291 127 634 209 517 
Matrix-Vector Product» 226 580 484 077 840 498 
Inner Product (enc) 11978 14696 17 429 
Inner Product (dec) 29 888 41910 53 792 
This work speed Matrix-Vector Product? 55 746 106 380 172 152 
Matrix-Vector Product» 211606 457 213 796 349 
Inner Product (enc) 8 762 10331 11898 
Inner Product (dec) 23 425 32 354 41275 


stack Matrix-Vector Product® 58 028 112 503 184149 
Matrix-Vector Product? 214053 463 590 808 206 
Inner Product (enc) 11218 13 877 16 733 


Inner Product (dec) 24 722 34 167 43 619 
a Measurement excluding the hashing. 
b Measurement including the hashing. 


5.3 Performance of Schemes 


Per Table 4, we achieve speedups of 3.3%—4.2%, 3.1%-3.6%, and 5.1%-5.2% for 
the key generation, encapsulation, and decapsulation our speed optimized vari- 
ant. As to be expected due to the caching of intermediate values for speed 
optimizations, our speed implementation has a higher stack usage. Our stack 
implementations use essentially the same stack as previous work. 

Table 5 contains the results for Dilithium. We achieve consistent speedups for 
all parameter sets. The absolute savings due to our optimizations are clearly seen, 
particularly in signing. The speedup for signing ranges from 1.5% to 5.6%. In 
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Table 4. Cycle counts and stack usage for Kyber for the key generation, encapsulation, 


and decapsulation. Cycle counts are averaged over 100 executions. 


implementation variant Kyber-512 Kyber-768 Kyber-1024 
cc stack [B] cce stack [B] cc stack [B] 
pqm4, [ABCG20] K 458k 2220 745k 3100 1188k 3612 
E 553k 2308 899k 2780 3292 3 292 
D 513k 2324 839k 2804 1294k 3324 
This work speed K 443k 4272 718k 5312 1138k 6 336 
E 536k 5376 870k 6416 7432 7 432 
D 487k 5384 796k 6432 1227k 7 448 
stack K 444k 2220 724k 2736 1149k 3256 
E 540k 2308 879k 2808 3328 3 328 
D 492k 2324 807k 2824 1246k 3352 


relative terms, the impact of our optimizations on the full Kyber and Dilithium 
seem relatively small compared to the speedups we gain for the polynomial 
arithmetic. This is due to dominance of the hashing operations as thoroughly 


analyzed in previous work [KRSS19]. 


Table 5. Cycle counts and stack usage for Dilithium. K, S, and V correspond to the key 
generation, signature generation, and signature verification. Cycle counts are averaged 


over 10000 executions. 


implementation variant Dilithium2 
cc stack [B] 
pqm4, [GKS20] K 1602k 38k 
S 4336k 49k 
V 1579k 36k 
This work speed K 1596k 8 508 
S 4093k 49k 
V 1572k 36k 


Dilithium3 

cc stack [B] 
2 835k 61k 
6 721k 74k 
2 700k 58k 
2 827k 9540 
6 623k 69k 
2 692k 58k 


Dilithium5 

cc stack [B] 
4836k 98k 
9037k 115k 
4718k 93k 
4829k 11696 
8 803k 116k 
4707k 93k 
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